R at scale on the Google Cloud Platform

Mark Edmondson (@HoloMarkeD)

May 20th, 2019 - CopenhagenR

code.markedmondson.me

fgf

Qualifications

  • Digital agencies since 2007
  • useR since 2012 - Motive: how to use all this web data?
  • Shiny enthusiast e.g. https://gallery.shinyapps.io/ga-effect/
  • Google Developer Expert - Google Analytics & Google Cloud
  • Several Google API themed packages on CRAN via googleAuthR
  • Part of cloudyr group (AWS/Azure/GCP R packages for the cloud) https://cloudyr.github.io/
  • Now: Data Engineer @ IIH Nordic

ga-effect

googleAuthRverse

  • searchConsoleR
  • googleAuthR
  • googleAnalyticsR
  • googleComputeEngineR (Cloudyr)
  • bigQueryR (Cloudyr)
  • googleCloudStorageR (Cloudyr)
  • googleLanguageR (rOpenSci)

Slack group to talk around the packages #googleAuthRverse

  • googleCloudVisionR
  • googleKubernetesR

I thought I knew a bit about R and Google Cloud but then…

GoogleNext19 - Data Science at Scale with R on GCP

A 40 mins talk at Google Next19 with lots of new things to try!

https://www.youtube.com/watch?v=XpNVixSN-Mg&feature=youtu.be

next-intro

New concepts

Great video that goes more into Spark clusters, Jupyter notebooks, training using ML Engine and scaling using Seldon on Kubernetes that I haven’t tried yet

next19

Some shots from the video

It (almost) always starts with Docker

Dockerfiles from The Rocker Project

https://www.rocker-project.org/

rocker-team

Dockerfiles

FROM rocker/tidyverse:3.6.0
MAINTAINER Mark Edmondson (r@sunholo.com)

# install R package dependencies
RUN apt-get update && apt-get install -y \
    libssl-dev 

## Install packages from CRAN
RUN install2.r --error \ 
    -r 'http://cran.rstudio.com' \
    googleAuthR \ 
    googleComputeEngineR \ 
    googleAnalyticsR \ 
    searchConsoleR \ 
    googleCloudStorageR \
    bigQueryR \ 
    ## install Github packages
    && installGithub.r MarkEdmondson1234/youtubeAnalyticsR \
    ## clean up
    && rm -rf /tmp/downloaded_packages/ /tmp/*.rds \

Docker + R = R in Production

  • Flexible No need to ask IT to install R places, just run docker run Cross cloud, future-proof(?)

  • Version controlled No worries that latest tidyverse update will break code

  • Scalable Run multiple Docker containers at once, fits into event-driven, stateless serverless future

Creating Docker images with Cloud Build

Continuous development with GitHub pushes

build-triggers

R Applications

How can we scale an R Script?

  • Vertical scaling - increase the size and power of one machine
  • Horizontal scaling - split up your problem into lots of little machines
  • Serverless scaling - send your code + data into cloud and let them sort out how many machines

Vertical scaling

Bigger boat

bigger-boat

Bigger VMs

Pros

Probably run the same code with no changes needed
Easy to setup

Cons

Expensive
May be better to have data in database

Launching a monster VM in the cloud

3.75TB of RAM: $423 a day

rstudio-server

Horizontal scaling

Lots of little machines can accomplish great things

dunkirk

Parellise your code

Pros

Docker infrastructure
library(future)

Cons

Changes to your code for split-map-reduce
Write meta code to handle I/O data and code
Not applicable to some problems

Adopt a split-map-reduce mindset

  • Break problems down into stateless lumps
  • Reuseable bricks that can be applied to other tasks

Setup a cluster

New in googleComputeEngineR v0.3 (7th May)

library(future)

googleComputeEngineR has custom backend for future

Forecasting example

Multi-layer future loops

Can multi-layer future loops (use each CPU within each VM)

Thanks for Grant McDermott for figuring optimal method (Issue #129)

CPU utilization

3 VMs, 8 CPUs each = 24 threads

Serverless scaling

Kubernetes

Clusters of VMs + Docker + Task controller = Kubernetes

Kubernetes

Pros

Auto-scaling, task queues etc.
Scale to billions
Potentially cheaper
May already have cluster 

Cons

Needs stateless, idempotent workflows
Message broker?
Minimum 3 VMs

Scaling Shiny and R APIs

Dockerfiles for Shiny apps

Built on Cloud Build upon GitHub push:

FROM rocker/shiny
MAINTAINER Mark Edmondson (r@sunholo.com)

# install R package dependencies
RUN apt-get update && apt-get install -y \
    libssl-dev
    
## Install packages from CRAN needed for your app
RUN install2.r --error \ 
    -r 'http://cran.rstudio.com' \
    googleAuthR \
    googleAnalyticsR

## assume shiny app is in build folder /shiny
COPY ./shiny/ /srv/shiny-server/myapp/

Dockerfiles for plumber APIs

Built on Cloud Buid every GitHub push:

FROM trestletech/plumber

# copy your plumbed R script     
COPY api.R /api.R

# default is to run the plumbed script
CMD ["api.R"]

Standard VM

Kubernetes

Shiny App:

kubectl run shiny1 --image gcr.io/gcer-public/shiny-googleauthrdemo:latest --port 3838
kubectl expose deployment shiny1 --target-port=3838  --type=NodePort

R plumber API:

kubectl run my-plumber --image gcr.io/your-project/my-plumber --port 8000
kubectl expose deployment my-plumber --target-port=8000  --type=NodePort

Shiny apps waiting for service

Expose your workloads via Ingress

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: r-ingress-nginx
spec:
  rules:
  - http:
      paths:
      - path: /gar/
      # app deployed to /gar/shiny/
        backend:
          serviceName: shiny1
          servicePort: 3838

Apps available at URL on demand

curl 'http://mydomain/api/echo?msg="its alive!"'
#> "The message is: its alive!"

shiny-app-on-k8s

The future?